1 Data description

In this tutorial, we aim to obtain differentially expressed genes between two biological conditions in an integrated data. This integrated data set is generated by combining 10 microarray studies on control and TGFb-treated dataset (for more information, see here http://mcr.aacrjournals.org/content/15/5/619). 

First, we assess the presence of unwanted variation in the integrated data which combines differents studies and platforms. Second, we show how to use two RUV methods: RUVinv and RUV4 to remove unwanted variation and detect differentially expressed genes comparing control dataset to TGFb-treated dataset. Third, we assess whether RUV methods are useful and compare the results obtained by each method.

library(ruv)            ## for applying RUV methods
library(limma)          ## for vennDiagram()
library(ggplot2)        ## for data visualisation

The integrated data introduced above have been split into two data sets: dataset A and dataset B, each containing different studies, platforms and tissues. We will explore and normalise these two data sets separately in order to compare the results obtained by the two normalisation methods (RUV4 and RUVinv).

Read in the two integrated dataset.

datasetA<- read.table("expr_10data_datasetA.txt", header = T, sep = "\t")
datasetB<- read.table("expr_10data_datasetB.txt", header = T, sep = "\t")

Look at the data in each dataset A and B, then make a matrix for each these data where the row names are gene IDs.

head(datasetA,3)
head(datasetB,3)
mA<-datasetA[,2:dim(datasetA)[2]]
mB<-datasetB[,2:dim(datasetB)[2]]
row.names(mA)<- datasetA[,1]
row.names(mB)<- datasetB[,1]
mA<- as.matrix(mA)
mB<- as.matrix(mB)

Look at the information related to each dataset including the name of the studies, types of platform, treatment, and tissue:

info_datasetA<- read.table("info_10data_datasetA.txt", sep="\t", header=T)
info_datasetA
info_datasetB<- read.table("info_10data_datasetB.txt", sep="\t", header=T)
info_datasetB

2 Assessment of unwanted variation in the data

Here we perform some exploratory analysis on the integrated data to assess the presence of unwanted variation in each dataset.

2.1 RLE plot

We start by looking at the RLE plots in dataset A data, coloured by study, platform and tissue.For more information about RLE plot see reference paper RLE Plots: Visualising Unwanted Variation in High Dimensional Data, Luke C. Gandolfo, Terence P. Speed 2017.

## Transpose the expression matrix so that we have genes in columns and dataset in rows
YA <- t(mA)
## Plot RLE coloured by study
ruv_rle(YA, info_datasetA$study, ylim = c(-4,4))

## Plot RLE coloured by platform
ruv_rle(YA, info_datasetA$platform, ylim = c(-4,4))

## Plot RLE coloured by platform
ruv_rle(YA, info_datasetA$tissue, ylim = c(-4,4))

Similarly, we look at the RLE plots in dataset B data coloured by study, platform and tissue:

YB <- t(mB)
## Plot RLE coloured by study
ruv_rle(YB, info_datasetB$study, ylim = c(-4,4))

## Plot RLE coloured by platform
ruv_rle(YB, info_datasetB$platform, ylim = c(-4,4))

## Plot RLE coloured by tissue
ruv_rle(YB, info_datasetB$tissue, ylim = c(-4,4))

2.2 PCA plot

In transcriptomics applications, one of the most utilized exploratory plots is the multi-dimensional scaling (MDS) plot or a principal component analysis (PCA) plot. To assess the presence of unwanted variation in each dataset, we use PCA plots to show similarities between dataset measured in an unsupervised way. Ideally, dataset should cluster together according to the treatment (i.e. the biological factor of interest). Here, we see that dataset are rather clustered by studies (i.e. unwanted variation) in both dataset A and B data.
In the current example, it’s important to note that in some cases, different studies are confounded with different platforms and tissues, and therefore there is no way to identify how much of the unwanted variation come from each of these factors. Such situations must be avoided when designing an experiment and for the purpose of this tutorial, we only consider “study” as the source of unwanted variation.

gg_additions <- list(aes(color = info_datasetA$study, 
                         shape = info_datasetA$treatment, 
                         size = 5, alpha = .7),
                     labs(color = "Study", shape="Treatment"),
                     scale_size_identity(guide = "none"),
                     scale_alpha(guide = "none"),
                     theme(legend.text = element_text(size = 12),
                           legend.title = element_text(size = 16)),
                     guides(color = guide_legend(override.aes = list(size = 4)),
                            shape = guide_legend(override.aes = list(size = 4))))
options(repr.plot.width = 8, repr.plot.height = 6)
ruv_svdplot(YA) + gg_additions 

gg_additions <- list(aes(color = info_datasetB$study,
                         shape = info_datasetB$treatment,
                         size = 5, alpha = .7),
                     labs(color = "Study",shape = "Treatment"),
                     scale_size_identity(guide = "none"),
                     scale_alpha(guide = "none"),
                     theme(legend.text = element_text(size = 12),
                           legend.title = element_text(size = 16)),
                     guides(color = guide_legend(override.aes = list(size = 4)),
                            shape = guide_legend(override.aes = list(size = 4))))
options(repr.plot.width=8, repr.plot.height=6)
ruv_svdplot(YB) + gg_additions #

3 Remove batch effects using RUV methods

There are several RUV methods for removing unwanted variation in order to obtain DEGs; these include RUV-2, RUV-4, RUV-inv and RUV-rinv. In general, RUV methods are dependent on negative control genes (genes which are not associated with the biological factor of interest) and replicate dataset (if applicable). RUV-2 removes unwanted variation in two steps. RUV-4 came after RUV-2 and has four steps. For RUV-4 we can estimate the dimension of unwanted variation (k) or select different values for k, while for RUV-inv and RUV-rinv we don’t need to estimate k as it is set to be the maximum value. In general, RUV-inv and RUV-rinv are better than RUV-4. RUV-inv is recommended when we have large number of control genes (~1000), while RUV-rinv is more appropriate with small number of control genes (~60).
Selection of negative control genes is very important. Examples of negative control genes are the spike-in controls or the housekeeping (HK) genes. It is also possible to define empirical negative control genes using an iterative approach. However, it is only recomended if (i) the initial negative control genes are not very good or there are only a few of them; (ii) the beta seems to be very sparse. Therefore, the user needs to generate diagnostic plots to assess the performance of the initial analysis, and only if needed, use an iterative approach to define better control genes.

3.1 RUV4

RUV4 can be run with different values of k in an attempt to find the optimal value. Note that although there is a function to estimate k, called getK(), this may not give the optimal value for k and is often recomended to be used when there is no other choice but to automate finding K (e.g. in simulations).

We begin with the list of housekeeping (HK) genes as our negative controls.

HKgenes <- read.table("HouseKeeping_genes_IDs.txt", header=T, sep="\t")
hk <- HKgenes$GeneID
ctrl <- colnames(YA) %in% hk

3.1.1 Apply RUV4 using house-keeping genes

## Take treatment as the biological factor of interest
groups_A <- factor(info_datasetA$treatment) 
gA <- cbind(as.numeric(groups_A))   ## 1 control, 2: TGFb
## Take treatment as the biological factor of interest
groups_B <- factor(info_datasetB$treatment) 
gB <- cbind(as.numeric(groups_B))   ## 1 control, 2: TGFb
ks <- c(1, 2, 5, 6, 7, 8, 10, 11, 12, 15, 18, 20, 22, 23, 24)
## For k > 24 I got Error:
# NaNs producedNaNs producedError in sigmashrink(fit$sigma2, fit$df) : 
#   NA/NaN/Inf in foreign function call (arg 1)
beta_corAB_HK <- vector()
fit_ruv4_hk_datasetA_all_k=list()
fit_ruv4_hk_datasetA_all_k.summary=list()
fit_ruv4_hk_datasetB_all_k=list()
fit_ruv4_hk_datasetB_all_k.summary=list()
for (K in ks){
  fit_ruv4_hk_datasetA_all_k[[K]] <- RUV4(YA, X = gA, 
                            ctl = ctrl, 
                            k = K,Z = 1, eta = NULL, 
                            fullW0 = NULL, inputcheck = TRUE)
  
  fit_ruv4_hk_datasetA_all_k.summary[[K]] <- ruv_summary(YA,
                                           fit_ruv4_hk_datasetA_all_k[[K]],
                                           info_datasetA)
  
  fit_ruv4_hk_datasetB_all_k[[K]] <- RUV4(YB, X = gB, 
                            ctl = ctrl, 
                            k = K,Z = 1, eta = NULL, 
                            fullW0 = NULL, inputcheck = TRUE)
  
  fit_ruv4_hk_datasetB_all_k.summary[[K]] <- ruv_summary(YB,
                                           fit_ruv4_hk_datasetB_all_k[[K]],
                                           info_datasetB)
  
  currentCor <- cor.test(fit_ruv4_hk_datasetA_all_k[[K]]$betahat,
                        fit_ruv4_hk_datasetB_all_k[[K]]$betahat)$estimate
  
  beta_corAB_HK <- c(beta_corAB_HK, currentCor)
  
}
names(beta_corAB_HK) <- ks
data.frame(beta_corAB_HK) ## K = 23 seems a good choice

4 Comparison of results of unadjusted and RUV4-adjusted data

In order to see if we have helped or not, we run differential expression analysis on the unadjusted data. Then we compare the results of RUV4 with different K values and unadjusted together.

4.1 Unadjusted data

We can run RUV4 with k = 0 to do no adjustment when obtaining DEGs.

4.1.1 Apply DE analysis in the unadjusted data using HK genes

# RUV4 with k = 0 for no adjustment
# Equivalent to a Limma Analysis without considering the batch term
##----- In dataset A data
fit_unadj_hk_datasetA <- RUV4(YA, X = gA, 
                          ctl = ctrl, 
                          k = 0)
fit_unadj_hk_datasetA.summary <- ruv_summary(YA, 
                                        fit_unadj_hk_datasetA,
                                        info_datasetA)
##----- In dataset B data
fit_unadj_hk_datasetB <- RUV4(YB, X = gB, 
                          ctl = ctrl,
                          k = 0)
fit_unadj_hk_datasetB.summary <- ruv_summary(YB, 
                                         fit_unadj_hk_datasetB,
                                         info_datasetB)

4.2 P-values distribution

We compare the results obtained from unajusted, RUVinv- and RUV4- adjusted data using p-value distributions, correlations of beta values and venn diagrams in dataset A and B data.

K=23
ruv_hist(fit_unadj_hk_datasetA.summary) + ggtitle("Unadj_A")

ruv_hist(fit_ruv4_hk_datasetA_all_k.summary[[K]]) + ggtitle("RUV4_HK_A")

ruv_hist(fit_unadj_hk_datasetB.summary) + ggtitle("Unadj_B")

ruv_hist(fit_ruv4_hk_datasetB_all_k.summary[[K]]) + ggtitle("RUV4_HK_B")

4.3 Betahat correlation

For each of the unadjusted, RUVinv- and RUV4- adjusted settings, we can compare the results between dataset A and B data sets to see which method gives more consistent resulst in these two data sets. To quantify these consistencies, we look at the correlations between betahat from dataset A and betahat from dataset B for the unadjusted method, RUVinv and RUV4.

##------ Unadjusted data sets
plot(fit_unadj_hk_datasetA$betahat, 
     fit_unadj_hk_datasetB$betahat,
     xlab = "Betahat dataset A",
     ylab = "Betahat dataset B",
     main = "Unadjusted",
     xlim = c(-3,3), cex = 0.3, ylim = c(-4,4))
corVal <- cor.test(fit_unadj_hk_datasetA$betahat, fit_unadj_hk_datasetB$betahat)$estimate
text(-3,3, pos = 4, paste("Correlation: ", round(corVal,2), sep = ""))

for (K in ks){
   
  #------- RUV4 adjusted data sets
  plot(fit_ruv4_hk_datasetA_all_k[[K]]$betahat,
     fit_ruv4_hk_datasetB_all_k[[K]]$betahat,
     xlab = "Betahat dataset A",
     ylab = "Betahat dataset B",
     main = paste("RUV4_K_",K,sep=""),
     xlim = c(-3,3), cex = 0.3, ylim = c(-4,4))
  #abline(fit_ruv4_emp_datasetB$betahat,fit_ruv4_emp_datasetA$betahat)
  corVal <- cor.test(fit_ruv4_hk_datasetA_all_k[[K]]$betahat, fit_ruv4_hk_datasetB_all_k[[K]]$betahat)$estimate
  text(-3,3, pos=4, paste("Correlation: ", round(corVal,2),sep=""))
}

4.4 Overlap between differentially expressed genes

Finally, for each of the unadjusted, RUVinv and RUV4, we look at the number of overlapping differentially expressed genes (DEGs) between the two dataset A and B data sets. We also check for consistency between methods on the same data (dataset A or dataset B data set).
First, we define DEGs as genes with adjusted p-value < 0.05 and |logFC| > 1 in dataset A and B data sets which are unadjusted, RUV4- or RUVinv-adjusted.

##------ In dataset A data
DEGsUnadj_datasetA <- row.names(fit_unadj_hk_datasetA.summary$C)[fit_unadj_hk_datasetA.summary$C$F.p.BH < 0.05 & abs(fit_unadj_hk_datasetA.summary$C$b_X1) > 1]
DEGsRUV4_datasetA <- row.names(fit_ruv4_hk_datasetA.summary$C)[fit_ruv4_hk_datasetA.summary$C$F.p.BH < 0.05 & abs(fit_ruv4_hk_datasetA.summary$C$b_X1) > 1]  
Error in row.names(fit_ruv4_hk_datasetA.summary$C) : 
  object 'fit_ruv4_hk_datasetA.summary' not found

4.4.1 DEGs in the unadjusted data

Venn diagram comparing the unadjusted dataset A and B data.

4.4.2 DEGs in the RUV4-adjusted data for different k

Venn diagram comparing the RUV4-adjusted dataset A and B data.

4.4.3 Compare the unadjusted and RUV4 in dataset A data

4.4.4 Compare the unadjusted and RUV4 in dataset B data

---
title: "How to remove unwanted variation from transcriptomics data using RUVinv and RUV4"
author: Sepideh Foroutan and Marie Trussart
output:
  github_document:
    toc: yes
  html_notebook:
    number_sections: yes
    toc: yes
    toc_float: yes
    code_folding: "hide"
    fig_caption: yes
---


# Data description
In this tutorial, we aim to obtain differentially expressed genes between two biological conditions in an integrated data. This integrated data set is generated by combining 10 microarray studies on control and TGFb-treated dataset (for more information, see here http://mcr.aacrjournals.org/content/15/5/619).\ 

First, we assess the presence of unwanted variation in the integrated data which combines differents studies and platforms. Second, we show how to use two RUV methods: RUVinv and RUV4 to remove unwanted variation and detect differentially expressed genes comparing control dataset to TGFb-treated dataset. Third, we assess whether RUV methods are useful and compare the results obtained by each method.
```{r load-libs}
library(ruv)            ## for applying RUV methods
library(limma)          ## for vennDiagram()
library(ggplot2)        ## for data visualisation
```

The integrated data introduced above have been split into two data sets: dataset A and dataset B, each containing different studies, platforms and tissues. We will explore and normalise these two data sets separately in order to compare the results obtained by the two normalisation methods (RUV4 and RUVinv).

Read in the two integrated dataset.
```{r load-data}
datasetA<- read.table("expr_10data_datasetA.txt", header = T, sep = "\t")
datasetB<- read.table("expr_10data_datasetB.txt", header = T, sep = "\t")
```

Look at the data in each dataset A and B, then make a matrix for each these data where the row names are gene IDs.
```{r explore-data}

head(datasetA,3)
head(datasetB,3)
mA<-datasetA[,2:dim(datasetA)[2]]
mB<-datasetB[,2:dim(datasetB)[2]]
row.names(mA)<- datasetA[,1]
row.names(mB)<- datasetB[,1]
mA<- as.matrix(mA)
mB<- as.matrix(mB)
```

Look at the information related to each dataset including the name of the studies, types of platform, treatment, and tissue:
```{r load-info-study}
info_datasetA<- read.table("info_10data_datasetA.txt", sep="\t", header=T)
info_datasetA
info_datasetB<- read.table("info_10data_datasetB.txt", sep="\t", header=T)
info_datasetB
```

# Assessment of unwanted variation in the data
Here we perform some exploratory analysis on the integrated data to assess the presence of unwanted variation in each dataset.

## RLE plot
We start by looking at the RLE plots in dataset A data, coloured by study, platform and tissue.For more information about RLE plot see reference paper [RLE Plots: Visualising Unwanted Variation in High Dimensional Data](https://arxiv.org/abs/1704.03590), Luke C. Gandolfo, Terence P. Speed 2017.

```{r rleplots-datasetA}
## Transpose the expression matrix so that we have genes in columns and dataset in rows
YA <- t(mA)
## Plot RLE coloured by study
ruv_rle(YA, info_datasetA$study, ylim = c(-4,4))
## Plot RLE coloured by platform
ruv_rle(YA, info_datasetA$platform, ylim = c(-4,4))
## Plot RLE coloured by platform
ruv_rle(YA, info_datasetA$tissue, ylim = c(-4,4))
```

Similarly, we look at the RLE plots in dataset B data coloured by study, platform and tissue:
```{r rleplots-datasetB}
YB <- t(mB)
## Plot RLE coloured by study
ruv_rle(YB, info_datasetB$study, ylim = c(-4,4))
## Plot RLE coloured by platform
ruv_rle(YB, info_datasetB$platform, ylim = c(-4,4))
## Plot RLE coloured by tissue
ruv_rle(YB, info_datasetB$tissue, ylim = c(-4,4))
```

## PCA plot
In transcriptomics applications, one of the most utilized exploratory plots is the multi-dimensional scaling (MDS) plot or a principal component analysis (PCA) plot. To assess the presence of unwanted variation in each dataset, we use PCA plots to show similarities between dataset measured in an unsupervised way. Ideally, dataset should cluster together according to the treatment (i.e. the biological factor of interest). Here, we see that dataset are rather clustered by studies (i.e. unwanted variation) in both dataset A and B data.\
In the current example, it's important to note that in some cases, different studies are confounded with different platforms and tissues, and therefore there is no way to identify how much of the unwanted variation come from each of these factors. Such situations must be avoided when designing an experiment and for the purpose of this tutorial, we only consider "study" as the source of unwanted variation.\

```{r mds-plot-datasetA}
gg_additions <- list(aes(color = info_datasetA$study, 
                         shape = info_datasetA$treatment, 
                         size = 5, alpha = .7),
                     labs(color = "Study", shape="Treatment"),
                     scale_size_identity(guide = "none"),
                     scale_alpha(guide = "none"),
                     theme(legend.text = element_text(size = 12),
                           legend.title = element_text(size = 16)),
                     guides(color = guide_legend(override.aes = list(size = 4)),
                            shape = guide_legend(override.aes = list(size = 4))))
options(repr.plot.width = 8, repr.plot.height = 6)
ruv_svdplot(YA) + gg_additions 
```

```{r mds-plot-datasetB}
gg_additions <- list(aes(color = info_datasetB$study,
                         shape = info_datasetB$treatment,
                         size = 5, alpha = .7),
                     labs(color = "Study",shape = "Treatment"),
                     scale_size_identity(guide = "none"),
                     scale_alpha(guide = "none"),
                     theme(legend.text = element_text(size = 12),
                           legend.title = element_text(size = 16)),
                     guides(color = guide_legend(override.aes = list(size = 4)),
                            shape = guide_legend(override.aes = list(size = 4))))
options(repr.plot.width=8, repr.plot.height=6)
ruv_svdplot(YB) + gg_additions #
```


# Remove batch effects using RUV methods
There are several RUV methods for removing unwanted variation in order to obtain DEGs; these include RUV-2, RUV-4, RUV-inv and RUV-rinv. In general, RUV methods are dependent on negative control genes (genes which are not associated with the biological factor of interest) and replicate dataset (if applicable). RUV-2 removes unwanted variation in two steps. RUV-4 came after RUV-2 and has four steps. For RUV-4 we can estimate the dimension of unwanted variation (k) or select different values for k, while for RUV-inv and RUV-rinv we don't need to estimate k as it is set to be the maximum value. In general, RUV-inv and RUV-rinv are better than RUV-4. RUV-inv is recommended when we have large number of control genes (~1000), while RUV-rinv is more appropriate with small number of control genes (~60).\

Selection of negative control genes is very important. Examples of negative control genes are the spike-in controls or the housekeeping (HK) genes. It is also possible to define empirical negative control genes using an iterative approach. However, it is only recomended if (i) the initial negative control genes are not very good or there are only a few of them; (ii) the beta seems to be very sparse. Therefore, the user needs to generate diagnostic plots to assess the performance of the initial analysis, and only if needed, use an iterative approach to define better control genes.


## RUV4
RUV4 can be run with different values of k in an attempt to find the optimal value. Note that although there is a function to estimate k, called **getK()**, this may not give the optimal value for k and is often recomended to be used when there is no other choice but to automate finding K (e.g. in simulations). 

We begin with the list of housekeeping (HK) genes as our negative controls.
```{r}
HKgenes <- read.table("HouseKeeping_genes_IDs.txt", header=T, sep="\t")
hk <- HKgenes$GeneID
ctrl <- colnames(YA) %in% hk
```


### Apply RUV4 using house-keeping genes
``` {r ruv4-HK-differentKs}
## Take treatment as the biological factor of interest
groups_A <- factor(info_datasetA$treatment) 
gA <- cbind(as.numeric(groups_A))   ## 1 control, 2: TGFb
## Take treatment as the biological factor of interest
groups_B <- factor(info_datasetB$treatment) 
gB <- cbind(as.numeric(groups_B))   ## 1 control, 2: TGFb

ks <- c(1, 2, 5, 6, 7, 8, 10, 11, 12, 15, 18, 20, 22, 23, 24)
## For k > 24 I got Error:
# NaNs producedNaNs producedError in sigmashrink(fit$sigma2, fit$df) : 
#   NA/NaN/Inf in foreign function call (arg 1)

beta_corAB_HK <- vector()
fit_ruv4_hk_datasetA_all_k=list()
fit_ruv4_hk_datasetA_all_k.summary=list()
fit_ruv4_hk_datasetB_all_k=list()
fit_ruv4_hk_datasetB_all_k.summary=list()
for (K in ks){
  fit_ruv4_hk_datasetA_all_k[[K]] <- RUV4(YA, X = gA, 
                            ctl = ctrl, 
                            k = K,Z = 1, eta = NULL, 
                            fullW0 = NULL, inputcheck = TRUE)
  
  fit_ruv4_hk_datasetA_all_k.summary[[K]] <- ruv_summary(YA,
                                           fit_ruv4_hk_datasetA_all_k[[K]],
                                           info_datasetA)
  
  fit_ruv4_hk_datasetB_all_k[[K]] <- RUV4(YB, X = gB, 
                            ctl = ctrl, 
                            k = K,Z = 1, eta = NULL, 
                            fullW0 = NULL, inputcheck = TRUE)
  
  fit_ruv4_hk_datasetB_all_k.summary[[K]] <- ruv_summary(YB,
                                           fit_ruv4_hk_datasetB_all_k[[K]],
                                           info_datasetB)
  
  currentCor <- cor.test(fit_ruv4_hk_datasetA_all_k[[K]]$betahat,
                        fit_ruv4_hk_datasetB_all_k[[K]]$betahat)$estimate
  
  beta_corAB_HK <- c(beta_corAB_HK, currentCor)
  
}
names(beta_corAB_HK) <- ks
data.frame(beta_corAB_HK) ## K = 23 seems a good choice
```

# Comparison of results of unadjusted and RUV4-adjusted data
In order to see if we have helped or not, we run differential expression analysis on the **unadjusted data**. Then we compare the results of RUV4 with different K values and unadjusted together.

## Unadjusted data
We can run RUV4 with k = 0 to do no adjustment when obtaining DEGs.

### Apply DE analysis in the unadjusted data using HK genes
```{r DE-unadjusted-A-B-using-HK}
# RUV4 with k = 0 for no adjustment
# Equivalent to a Limma Analysis without considering the batch term
##----- In dataset A data
fit_unadj_hk_datasetA <- RUV4(YA, X = gA, 
                          ctl = ctrl, 
                          k = 0)
fit_unadj_hk_datasetA.summary <- ruv_summary(YA, 
                                        fit_unadj_hk_datasetA,
                                        info_datasetA)

##----- In dataset B data
fit_unadj_hk_datasetB <- RUV4(YB, X = gB, 
                          ctl = ctrl,
                          k = 0)
fit_unadj_hk_datasetB.summary <- ruv_summary(YB, 
                                         fit_unadj_hk_datasetB,
                                         info_datasetB)

```

## P-values distribution
We compare the results obtained from unajusted, RUVinv- and RUV4- adjusted data using p-value distributions, correlations of beta values and venn diagrams in dataset A and B data.

```{r}
K=23
ruv_hist(fit_unadj_hk_datasetA.summary) + ggtitle("Unadj_A")
ruv_hist(fit_ruv4_hk_datasetA_all_k.summary[[K]]) + ggtitle("RUV4_HK_A")

ruv_hist(fit_unadj_hk_datasetB.summary) + ggtitle("Unadj_B")
ruv_hist(fit_ruv4_hk_datasetB_all_k.summary[[K]]) + ggtitle("RUV4_HK_B")
```

## Betahat correlation
For each of the unadjusted, RUVinv- and RUV4- adjusted settings, we can compare the results between dataset A and B data sets to see which method gives more consistent resulst in these two data sets. To quantify these consistencies, we look at the correlations between betahat from dataset A and betahat from dataset B for the unadjusted method, RUVinv and RUV4.

```{r cor-betahat-A-B-HK}
##------ Unadjusted data sets
plot(fit_unadj_hk_datasetA$betahat, 
     fit_unadj_hk_datasetB$betahat,
     xlab = "Betahat dataset A",
     ylab = "Betahat dataset B",
     main = "Unadjusted",
     xlim = c(-3,3), cex = 0.3, ylim = c(-4,4))
corVal <- cor.test(fit_unadj_hk_datasetA$betahat, fit_unadj_hk_datasetB$betahat)$estimate
text(-3,3, pos = 4, paste("Correlation: ", round(corVal,2), sep = ""))


for (K in ks){
   
  #------- RUV4 adjusted data sets
  plot(fit_ruv4_hk_datasetA_all_k[[K]]$betahat,
     fit_ruv4_hk_datasetB_all_k[[K]]$betahat,
     xlab = "Betahat dataset A",
     ylab = "Betahat dataset B",
     main = paste("RUV4_K_",K,sep=""),
     xlim = c(-3,3), cex = 0.3, ylim = c(-4,4))
  #abline(fit_ruv4_emp_datasetB$betahat,fit_ruv4_emp_datasetA$betahat)
  corVal <- cor.test(fit_ruv4_hk_datasetA_all_k[[K]]$betahat, fit_ruv4_hk_datasetB_all_k[[K]]$betahat)$estimate
  text(-3,3, pos=4, paste("Correlation: ", round(corVal,2),sep=""))
}

```

## Overlap between differentially expressed genes 
Finally, for each of the unadjusted, RUVinv and RUV4, we look at the number of overlapping differentially expressed genes (DEGs) between the two dataset A and B data sets. We also check for consistency between methods on the same data (dataset A or dataset B data set).\

First, we define DEGs as genes with adjusted p-value < 0.05 and |logFC| > 1 in dataset A and B data sets which are unadjusted, RUV4- or RUVinv-adjusted.

```{r}
K=23
##------ In dataset A data
DEGsUnadj_datasetA <- row.names(fit_unadj_hk_datasetA.summary$C)[fit_unadj_hk_datasetA.summary$C$F.p.BH < 0.05 & abs(fit_unadj_hk_datasetA.summary$C$b_X1) > 1]

DEGsRUV4_datasetA <- row.names(fit_ruv4_hk_datasetA_all_k.summary[[K]]$C)[fit_ruv4_hk_datasetA_all_k.summary[[K]]$C$F.p.BH < 0.05 & abs(fit_ruv4_hk_datasetA_all_k.summary[[K]]$C$b_X1) > 1]  

DEGsRUV4_datasetA_all_k=list()
for (K in ks){
  DEGsRUV4_datasetA_all_k[[K]]<- row.names(fit_ruv4_hk_datasetA_all_k.summary[[K]]$C)[fit_ruv4_hk_datasetA_all_k.summary[[K]]$C$F.p.BH < 0.05 & abs(fit_ruv4_hk_datasetA_all_k.summary[[K]]$C$b_X1) > 1]  
}

##------ In dataset B data:
DEGsUnadj_datasetB <- row.names(fit_unadj_hk_datasetB.summary$C)[fit_unadj_hk_datasetB.summary$C$F.p.BH < 0.05 & abs(fit_unadj_hk_datasetB.summary$C$b_X1) > 1]


DEGsRUV4_datasetB <- row.names(fit_ruv4_hk_datasetB_all_k.summary[[K]]$C)[fit_ruv4_hk_datasetB_all_k.summary[[K]]$C$F.p.BH < 0.05 & abs(fit_ruv4_hk_datasetB_all_k.summary[[K]]$C$b_X1) > 1]  

DEGsRUV4_datasetB_all_k=list()
for (K in ks){
  DEGsRUV4_datasetB_all_k[[K]]<- row.names(fit_ruv4_hk_datasetB_all_k.summary[[K]]$C)[fit_ruv4_hk_datasetB_all_k.summary[[K]]$C$F.p.BH < 0.05 & abs(fit_ruv4_hk_datasetB_all_k.summary[[K]]$C$b_X1) > 1]  
}
```

### DEGs in the unadjusted data
Venn diagram comparing the unadjusted dataset A and B data.
```{r Venn-DEGs-unadj}
allDEGs_unadj <- c(DEGsUnadj_datasetA, 
                    DEGsUnadj_datasetB)

## remove duplicated gene symbols:
allDEGs_unadj <- allDEGs_unadj[!duplicated(allDEGs_unadj)]  

## Draw a Venn diagram comparing DEGs for RUVinv
Counts_unadj <- matrix(0, nrow = length(allDEGs_unadj), ncol = 2)
row.names(Counts_unadj)<- allDEGs_unadj
colnames(Counts_unadj)<- c("Unadj_A","Unadj_B")

for( i in 1:length(allDEGs_unadj)) {
  Counts_unadj[i,1]<- allDEGs_unadj[i] %in% DEGsUnadj_datasetA
  Counts_unadj[i,2]<- allDEGs_unadj[i] %in% DEGsUnadj_datasetB
}

col <- c("blue", "violet")
vennDiagram(vennCounts(Counts_unadj), 
            circle.col = col, 
            cex = c(1.6, 1.2, 1), lwd=2)

```


### DEGs in the RUV4-adjusted data for different k
Venn diagram comparing the RUV4-adjusted dataset A and B data.
```{r Venn-DEGs-RUV4}
# 
# allDEGs_RUV4<- c(DEGsRUV4_datasetA, DEGsRUV4_datasetB)
# ## remove duplicated gene symbols:
# allDEGs_RUV4<- allDEGs_RUV4[!duplicated(allDEGs_RUV4)]  
# ## Draw a Venn diagram comparing DEGs for RUV4
# Counts_RUV4 <- matrix(0, nrow= length(allDEGs_RUV4), ncol=2)
# row.names(Counts_RUV4)<- allDEGs_RUV4
# colnames(Counts_RUV4)<- c("RUV4_A","RUV4_B")
# 
# for( i in 1:length(allDEGs_RUV4)) {
#   Counts_RUV4[i,1]<- allDEGs_RUV4[i] %in% DEGsRUV4_datasetA
#   Counts_RUV4[i,2]<- allDEGs_RUV4[i] %in% DEGsRUV4_datasetB
# }
# 
# col<- c("blue", "violet")
# vennDiagram(vennCounts(Counts_RUV4), 
#             circle.col = col, 
#             cex = c(1.6, 1.2, 1), lwd = 2)

allDEGs_RUV4_all_k=list()
for (K in ks){
  allDEGs_RUV4_all_k[[K]]=c(DEGsRUV4_datasetA_all_k[[K]],DEGsRUV4_datasetB_all_k[[K]])
  ## remove duplicated gene symbols:
  allDEGs_RUV4_all_k[[K]]<- allDEGs_RUV4_all_k[[K]][!duplicated(allDEGs_RUV4_all_k[[K]])]  
  ## Draw a Venn diagram comparing DEGs for RUV4
  Counts_RUV4<- matrix(0, nrow= length(allDEGs_RUV4_all_k[[K]]), ncol=2)
  row.names(Counts_RUV4)<- allDEGs_RUV4_all_k[[K]]
  colnames(Counts_RUV4)<- c(paste("RUV4_A_K_",K,sep=""),paste("RUV4_B_K_",K,sep=""))

  for( i in 1:length(allDEGs_RUV4_all_k[[K]])) {
    Counts_RUV4[i,1]<- allDEGs_RUV4_all_k[[K]][i] %in% DEGsRUV4_datasetA_all_k[[K]]
    Counts_RUV4[i,2]<- allDEGs_RUV4_all_k[[K]][i] %in% DEGsRUV4_datasetB_all_k[[K]]
  }

  col<- c("blue", "violet")
  vennDiagram(vennCounts(Counts_RUV4), 
            circle.col = col, 
            cex = c(1.6, 1.2, 1), lwd = 2)
}

```

### Compare the unadjusted and RUV4 in dataset A data
```{r Venn-DEGs-datasetA}
K=23
allDEGs_datasetA <- c(DEGsUnadj_datasetA,  DEGsRUV4_datasetA_all_k[[K]])

## remove duplicated gene symbols:
allDEGs_datasetA <- allDEGs_datasetA[!duplicated(allDEGs_datasetA)]  

## Draw a Venn diagram comparing DEGs for dataset A
Counts_datasetA <- matrix(0, nrow= length(allDEGs_datasetA), ncol=2)
row.names(Counts_datasetA)<- allDEGs_datasetA
colnames(Counts_datasetA)<- c("Unadj_A", "Ruv4_A_K_23")

for( i in 1:length(allDEGs_datasetA)) {
  Counts_datasetA[i,1]<- allDEGs_datasetA[i] %in% DEGsUnadj_datasetA
  Counts_datasetA[i,2]<- allDEGs_datasetA[i] %in% DEGsRUV4_datasetA_all_k[[K]]
}

col<- c("blue","darkgreen")
vennDiagram(vennCounts(Counts_datasetA), 
            circle.col=col, 
            cex=c(1.6, 1.2, 1), lwd=2)

```

### Compare the unadjusted and RUV4 in dataset B data
```{r Venn-DEGs-datasetB}
K=23
allDEGs_datasetB <- c(DEGsUnadj_datasetB,DEGsRUV4_datasetB_all_k[[K]])

## remove duplicated gene symbols:
allDEGs_datasetB <- allDEGs_datasetB[!duplicated(allDEGs_datasetB)] 

## Draw a Venn diagram comparing DEGs for dataset B
Counts_datasetB <- matrix(0, nrow = length(allDEGs_datasetB), ncol = 2)
row.names(Counts_datasetB) <- allDEGs_datasetB
colnames(Counts_datasetB) <- c("Unadj_B", "Ruv4_B_K_23")

for( i in 1:length(allDEGs_datasetB)) {
  Counts_datasetB[i,1]<- allDEGs_datasetB[i] %in% DEGsUnadj_datasetB
  Counts_datasetB[i,2]<- allDEGs_datasetB[i] %in% DEGsRUV4_datasetB_all_k[[K]]
}

col <- c("blue", "darkgreen")
vennDiagram(vennCounts(Counts_datasetB), 
            circle.col = col, 
            cex = c(1.6, 1.2, 1), lwd = 2)
```






